v0.2.5: FP16/BF16 support, TF32 optimization, reduction ops#61
Merged
v0.2.5: FP16/BF16 support, TF32 optimization, reduction ops#61
Conversation
## NVRTC Error Handling - Add NvrtcErrorCode enum (C++ and Python) - NvrtcError now includes error code and compilation log - Expose compilation_log property in Python exceptions ## PTX ISA Version Detection & Fallback - Add get_recommended_arch() for automatic architecture selection - Add get_fallback_archs() for fallback architecture list - Auto-retry PTX loading with lower architectures on ISA mismatch ## Retry Logic for Transient Failures - Retry OutOfMemory, InternalError up to 3 times - Exponential backoff (100ms, 200ms, 400ms) ## JIT Warmup System - Add warmup() function for pre-initializing NVRTC - Support background warmup with callback - Add is_warmup_done(), get_warmup_error() for status ## Driver Version Documentation - Add get_driver_requirements() returning min requirements - Add check_driver_compatibility() for compatibility check - Minimum: CUDA 11.0+, SM 8.0 (Ampere)+ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add disk-based PTX cache with architecture fingerprinting: Rust implementation (pygpukit-core): - PersistentCache: disk-based PTX storage with JSON index - ArchFingerprint: GPU characteristics for cache key (SM version, memory, driver) - LRU eviction by entry count and total size - TTL-based expiration with auto cleanup - Serde serialization for persistence Python bindings (pygpukit-python): - ArchFingerprint: GPU architecture fingerprint - PersistentCacheConfig: cache directory, size limits, TTL - PersistentCache: insert, get, remove, cleanup, stats - PersistentEntry: cached PTX with metadata - PersistentCacheStats: hit rate, evictions, errors Features: - Architecture-aware cache keys (SM version + driver version) - Automatic directory creation and index management - Size-based and count-based eviction policies - Cross-session cache persistence Tests: 5 new tests for persistent_cache (119 total) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements missing elementwise operations: - Binary ops: sub (a - b), div (a / b) - all dtypes - Unary ops: exp, log, relu - float32/float64 only Each operation includes: - CUDA kernel implementations (f32, f64, i32, i64 for binary) - pybind11 bindings with in-place variants - Python wrappers with CPU fallback - dtype validation for unary ops 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Research findings for CUTLASS-level TF32 kernel optimization: - Swizzled shared memory layout (XOR-based bank conflict elimination) - ldmatrix instruction usage and TF32 limitations - Multi-stage pipeline considerations (4-stage vs current 2-stage) - Recommended implementation order with expected gains - Reference materials from NVIDIA CUTLASS and academic papers Current: 27.38 TFLOPS → Target: 35+ TFLOPS 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
TF32 v2 kernel baseline (same structure as v1): - BM=128, BN=128, BK=16 - 2-stage pipeline - Padding for bank conflicts Benchmark results (RTX 3090 Ti): - 2048x2048: 11.12 TFLOPS - 4096x4096: 20.46 TFLOPS - 8192x8192: 29.12 TFLOPS cuBLAS reference: - 8192x8192: 41.79 TFLOPS Current efficiency: 70% of cuBLAS Target: 90% (37.6 TFLOPS) Correctness: PASS (p99 < 2%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmark results (RTX 3090 Ti): - 2048x2048: 10.83 TFLOPS (cuBLAS: 30.60) - 4096x4096: 19.09 TFLOPS (cuBLAS: 35.49) - 8192x8192: 25.48 TFLOPS (cuBLAS: 41.79) Correctness: PASS Note: WMMA API is slower than PTX mma.sync (was ~29 TFLOPS). Need to return to PTX with better optimizations.
Benchmark results (RTX 3090 Ti): - 2048x2048: 10.36 TFLOPS (cuBLAS: 30.60) - 4096x4096: 17.86 TFLOPS (cuBLAS: 35.49) - 8192x8192: 24.26 TFLOPS (cuBLAS: 41.79) Correctness: PASS Note: BK=32 reduces occupancy. Need to optimize.
Benchmark results (RTX 3090 Ti): - 2048x2048: 10.94 TFLOPS (cuBLAS: 30.60) - 4096x4096: 18.85 TFLOPS (cuBLAS: 35.49) - 8192x8192: 24.93 TFLOPS (cuBLAS: 41.79) Correctness: PASS
Benchmark results (RTX 3090 Ti): - 2048x2048: 10.83 TFLOPS (cuBLAS: 30.60) - 4096x4096: 18.75 TFLOPS (cuBLAS: 35.49) - 8192x8192: 25.22 TFLOPS (cuBLAS: 41.79) Correctness: PASS Note: 3-stage uses too much smem, reduces occupancy. v1 (2-stage) achieves 30.10 TFLOPS. Need different approach.
Benchmark results (RTX 3090 Ti): - 2048x2048: 11.47 TFLOPS (cuBLAS: 30.60) - 4096x4096: 21.30 TFLOPS (cuBLAS: 35.49) - 8192x8192: 29.94 TFLOPS (cuBLAS: 41.79) Correctness: PASS Matches v1 baseline (30.10 TFLOPS). Need 25% improvement to reach 90% cuBLAS target.
Summary of attempts: - 256x128 tile: 26.85 TFLOPS (occupancy issue) - 64x256 tile: 21.33 TFLOPS (too much B loading) - 3-stage BK=8: 27.44 TFLOPS (too many K iterations) - v1 baseline: 30.10 TFLOPS (best so far) cuBLAS reference: 41.79 TFLOPS @ 8192 Target: 37.6 TFLOPS (90%) Next: try Split-K parallelization
Benchmark results (RTX 3090 Ti): - 2048x2048: 12.16 TFLOPS - 4096x4096: 21.64 TFLOPS - 8192x8192: 30.45 TFLOPS (73% of cuBLAS) Correctness: PASS (p99 rel error < 2%) Target: 37.6 TFLOPS (90% of cuBLAS) - 24% gap remaining 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Multiple optimization approaches tried: - Double nested pipeline (GMEM→SMEM + SMEM→RMEM) - Preloaded A fragments - L1 caching (cp.async.ca vs .cg) - BK=32 (too much smem) - 3-stage pipeline - Register double buffering Benchmark results (RTX 3090 Ti): - 2048x2048: ~12 TFLOPS - 4096x4096: ~21 TFLOPS - 8192x8192: ~29 TFLOPS (69% of cuBLAS 41.79) Correctness: PASS (p99 rel error < 2%) Target: 37.6 TFLOPS (90% of cuBLAS) Additional optimizations needed to reach 90%: - ldmatrix for efficient fragment loading - Swizzled shared memory - Different tile configurations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Types: - Add float16, bfloat16 to Python dtypes - Add Float16, BFloat16 to C++ DataType enum Elementwise ops (FP16/BF16): - add, mul, sub, div kernels with FP32 intermediate Matmul: - Add matmul_f16_bf16.cuh with simple kernels (FP32 accumulation) Reduction ops: - Add sum, mean, max (Python API + C++ placeholders) Status: WIP - needs build verification and testing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add arithmetic operators: +, -, *, /, @ - Add astype() for dtype conversion - Handle BF16 <-> FP32 conversion correctly Test results (RTX 3090 Ti): - FP16 elementwise: PASS - BF16 elementwise: PASS - FP16 matmul: PASS (rel error < 0.05) - BF16 matmul: PASS (rel error < 0.05) - Reduction ops (sum, mean, max): PASS Benchmark (simple kernels, no TensorCore): - FP16 4096x4096: 2.18 TFLOPS - BF16 4096x4096: 2.16 TFLOPS - FP32 4096x4096: 6.29 TFLOPS (reference) Note: FP16/BF16 matmul uses naive kernels. TensorCore optimization planned. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add examples/demo_v025.py with full feature demonstration - Update README.md with v0.2.5 features (FP16/BF16, reductions, operators) - Add FP16/BF16 benchmark results to performance table - Update roadmap: v0.2.5 released, v0.2.6+ planned Demo output (RTX 3090 Ti): - FP16/BF16 elementwise: PASS - FP16/BF16 matmul: PASS - Reduction ops: PASS Benchmark (8192x8192): - FP32: 12.7 TFLOPS - TF32: 13.0 TFLOPS - FP16: 2.3 TFLOPS (simple kernel) - BF16: 2.2 TFLOPS (simple kernel) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmarks all dtypes (FP32, TF32, FP16, BF16) with: - Correctness verification - Performance measurement (median TFLOPS) - README-compatible markdown output - Mode detection (Driver-Only vs Full JIT) Usage: python benchmark_all.py [--sizes 2048,4096,8192] [--quick] Results (RTX 3090 Ti, Full mode): | Size | FP32 | TF32 | FP16 | BF16 | |------|------|------|------|------| | 2048 | 13.2 | 13.3 | 2.4 | 2.4 | | 4096 | 22.6 | 23.6 | 2.4 | 2.4 | | 8192 | 30.3 | 30.2 | 2.4 | 2.3 | 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rewrote benchmark_all.py to use env vars for TF32 kernel selection - Added --tf32-version v1|v2 option - Clarified Driver-Only vs JIT modes (same matmul performance) - Updated README.md with accurate kernel-only timing results Benchmark results (RTX 3090 Ti): - FP32: 9.6 / 14.7 / 16.7 TFLOPS (2k/4k/8k) - TF32 v2: 13.2 / 22.8 / 29.7 TFLOPS (2k/4k/8k) - FP16: ~2.4 TFLOPS - BF16: ~2.3 TFLOPS 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Renamed benchmark_all.py → benchmark.py - Deleted redundant files: benchmark_tf32.py, benchmark_ampere.py, bench_tf32_v2.py - Kept: benchmark_rust.py (scheduler), benchmark_pytorch.py (cuBLAS comparison) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CLAUDE.md: add benchmark.py usage instructions - README.md: update v0.2.5 highlights with TF32 v2 (~30 TFLOPS) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused sys import - Fix bare except clause 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix import sorting (I001) in multiple files - Remove unused imports (F401) - Remove unused f-string prefixes (F541) - Add per-file-ignores for examples/ and compiler.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix mypy error: add type annotation for 'converted' variable - Format all tracked Python files with ruff - Add comprehensive PR checklist (lint, mypy, tests, benchmark) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This was referenced Dec 15, 2025
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Release candidate for v0.2.5 with the following features:
New Features
FP16/BF16 Data Types: Half-precision and brain floating point support
gpk.float16,gpk.bfloat16dtypesastype()method for dtype conversionReduction Operations:
gpk.sum(a)- Sum of all elementsgpk.mean(a)- Mean of all elementsgpk.max(a)- Maximum elementOperator Overloads:
a + b,a - b,a * b,a / b- Elementwise opsa @ b- Matrix multiplicationPerformance Improvements
Benchmark Results (RTX 3090 Ti)
Documentation
benchmark.pyusage instructions to CLAUDE.mdexamples/demo_v025.pydemo scriptTest Results
🤖 Generated with Claude Code